I am very excited about this course. I think I will learn a lot of interesting and very useful things. I study statistics, so I want this course be a bit more practical that my other studies.
Johanna Tuhkanen recommended me this course. Also, I got a letter from Kimmo Vehkalahti advertising this course.
students14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
head(students14)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
str(students14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students14)
## [1] 166 7
The data is collected from the survey on teaching and learning. In this data, we have 166 students’ answers. In each observation, we have a responder’s age and gender. Also, we have 4 variables that measure student’s deep, surface and strategic approach to learning and his/her attitude toward statistics. The last variable “points” tells us his/her exam score.
library(ggplot2)
library(GGally)
summary(students14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
ggpairs(students14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
ggpairs(students14, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
Let us first describe the distributions of the variables. Then we will take a look at their connection to each other, especially to the “points” variable.
The “gender” variable is nominal. It means that it can take only 2 values, 0 and 1. 0 stands for woman and 1 - for man. According to the plot, we have more women answered this survey than men. Also, we notice that on average, our responders are quite young.
With this data, it is very hard to notice any corrections between this variable. Fortunately, the plot helps us. The strongest correlation happens between “attitude” and “points” variables. Additionally, there is a negative correlation (-0.3, not very strong) between deep and surface approaches.
Structural and surface approaches correlate with points, as well. I think that those variables and “attitude” should be chosen as explanatory variables for a regression model where exam points are the response variable.
Let us start with this model: y_i = b_0 + b_1*x_(attitude, i) + b_2*x_(stra, i) + b_3*x(surf, i)
We are going to fit it now.
model <- lm(points ~ attitude + stra + surf, data = students14)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
So we get the following results: - Regression coefficients are 3.4 for “attitude”, 0.9 for “stra” and -0.6 for “surf”. - However, the p-values for “stra” and “surf” are very big (>0.1). It means that those regression coefficients may be zero. So these 2 variables are not statistically significant with the target variable. - The P-value for “attitude” is almost zero, which means this variable has a statistically significant relationship with the “point” variable.
Let us remove statistically no-significant variables and fit the model again.
model <- lm(points ~ attitude, data = students14)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = students14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In this model, we have only the “attitude” variable. The regression coefficient is 3.5. It means that when a student’s attitude towards statistics grows by one point, a student gets 3.5 points more. The multiple R-squared of this model is 0.19, which means that the “attitude” variable explains 20% of the variation in the “points” variable.
When we use this model, we make 2 assumptions: 1. The variable “points” is normally distributed 2. e_i is normally distributed with the mean of 0 and constant variance.
The plots we are going to plot now will help us prove this assumption.
par(mfrow = c(2,2))
plot(model, which = c(1,2,5))
Normal Q-Q shows that there is little evidence of a departure from linearity. That means that there is no evidence that the “points” variable is not normally distributed.
“Residuals vs Fitted” plot does not show that there is the variance of the residuals increases on decreases with the fitted values. So we can say that the variance of e_i is constant.
Leverage measures how much impact a single observation has on a model. One of observation’s leverage is 0.04. It is very small. So we can say that this data does not have any outliers that would affect the model.
alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = "," , header=TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data is about the students of 2 Portuguese schools. It contains basic information, like gender and age, about the students. Also, it tells about the alcohol consumption of the students.
The purpose of my analysis is to study the relationships between high/low alcohol consumption and the following variables: 1. the student’s age 2. the student’s school 3. the quality of student’s family relationships 4. the student’s plans about taking higher education
I think that age might have a significant effect on the consumption of alcohol. I hypothesize that as a student becomes older, he or she drink more alcohol. Also, I am interested to know if there is any difference in alcohol consumption between these 2 schools. It might be that one of these schools is located in a not so good area, which makes alcohol consumption bigger. My hypothesis about the effect of a student’s family relationship is that students who have a good relationship with their family are less prone to drink than the students whose relationships are not that good. My 4th explanatory variable is the student’s plans about taking higher education. I think that if a student is planning to study further, he or she more focus on their studies and drink less than others.
Let us start with the variable “age”.
library(ggplot2)
ggplot(data = alc, aes(x = age, fill = high_use)) + geom_bar()
What we see is that we have a pick at the age of 17.
The second variable is school.
ggplot(data = alc, aes(x = school, fill = high_use)) + geom_bar()
It’s quite hard to say if alcohol consumption is bigger in one of the schools. In our further analysis, it might turn out that this variable and our response variable are independent.
The next graph is about alcohol consumption and family.
ggplot(data = alc, aes(x = famrel, fill = high_use)) + geom_bar()
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
t <- table(family = alc$famrel, high_use = alc$high_use)
t
## high_use
## family FALSE TRUE
## 1 7 2
## 2 9 9
## 3 40 26
## 4 131 52
## 5 83 23
for (i in 1:nrow(t)) {
sum <- t[i,1] + t[i,2]
t[i, 1] <- t[i,1]/sum
t[i, 2] <- t[i,2]/sum
}
t
## high_use
## family FALSE TRUE
## 1 0.7777778 0.2222222
## 2 0.5000000 0.5000000
## 3 0.6060606 0.3939394
## 4 0.7158470 0.2841530
## 5 0.7830189 0.2169811
Based on this last table, we can say that there is a negative correlation between this variable and the response variable.
Also, let us see if students who want to continue their studies are not heavy drinkers.
ggplot(data = alc, aes(x = higher, fill = high_use)) + geom_bar()
t <- table(family = alc$higher, high_use = alc$high_use)
t
## high_use
## family FALSE TRUE
## no 9 9
## yes 261 103
for (i in 1:nrow(t)) {
sum <- t[i,1] + t[i,2]
t[i, 1] <- t[i,1]/sum
t[i, 2] <- t[i,2]/sum
}
t
## high_use
## family FALSE TRUE
## no 0.500000 0.500000
## yes 0.717033 0.282967
It seems that my hypothesis is quite true.
Not it is time to use logistic regression to statistically explore the relationship between the variables and the binary high/low alcohol consumption variable as the target variable.
model <- glm(high_use ~ age + school + famrel + higher, data = alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ age + school + famrel + higher, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3740 -0.8320 -0.7316 1.3546 1.8605
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.01814 1.96224 -1.028 0.3037
## age 0.17710 0.10748 1.648 0.0994 .
## schoolMS -0.01764 0.38515 -0.046 0.9635
## famrel -0.29855 0.12189 -2.449 0.0143 *
## higheryes -0.68147 0.50601 -1.347 0.1781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 449.89 on 377 degrees of freedom
## AIC: 459.89
##
## Number of Fisher Scoring iterations: 4
Based on the result we got from fitting the model, the only variable “famrel” is statistically significant.
The odd ratios of the coefficients and their 95% confindence intervals are below:
exp(cbind(coef(model), confint(model)))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.1329027 0.002799011 6.2614310
## age 1.1937562 0.967246744 1.4759900
## schoolMS 0.9825138 0.452417639 2.0657936
## famrel 0.7418952 0.583097300 0.9419763
## higheryes 0.5058743 0.185373265 1.3859659
Here we have one more evidence of the fact that only “famrel” is statistically significant. Its confidence interval does not contain 1 and so it does have an effect on alcohol consumption. Others’ confidence interval contains 1, so there is no evidence that they really affect our response variable.
Now we are going to create a model that contains only “farmel” as an explanatory variable.
model <- glm(high_use ~ famrel, data = alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ famrel, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1631 -0.8220 -0.7244 1.4489 1.7125
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2587 0.4744 0.545 0.5855
## famrel -0.2925 0.1197 -2.445 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 456.24 on 380 degrees of freedom
## AIC: 460.24
##
## Number of Fisher Scoring iterations: 4
So, let us explore the predictive power of this model.
probabilities <- predict(model, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2931937
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE Sum
## FALSE 0.7068063 0.7068063
## TRUE 0.2931937 0.2931937
## Sum 1.0000000 1.0000000
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
cv$delta[1]
## [1] 0.3010471
The average number of wrong predictions is 0.29, which is bigger than the one in datacamp exercise.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset is small in size with only 506 cases. There are 14 attributes in each case of the dataset. They are:
All varaibles are numerical. No every variable is continious.
library(tidyverse)
library(corrplot)
corr_matrix <- cor(Boston)
corrplot(corr_matrix, method = 'square', type = 'upper', cl.pos = 'b', tl.cex = 0.6)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Because the pairs plot is very complex this time, I decided to explore a corrplot at first. To make this plot I calculated the correlation matrix first and then created this plot with the function “corrplot” from the package “corrplot”.
What I notice from this plot is
For example, we can take a look at the relationship of rad and tax.
ggplot(data = Boston, aes(x = rad, y = tax)) +
geom_point()
I think we just found the reason for such a strong correlation. The outlier in the right upper corner might be the reason for it. We can test what happens if we remove it.
library(dplyr)
cor_test <- as.data.frame(cbind(Boston$rad, Boston$tax))
cor_test <- filter(cor_test, V1 < 20)
cor(cor_test)
## V1 V2
## V1 1.0000000 0.1882562
## V2 0.1882562 1.0000000
Removing this outlier the correlation becomes about 0.2. So my hypothesis about why their correlation is so strong was right.
To go on we need to standardise or to scale our dataset. It can be made easily with the function “scale”.
b_scaled <- scale(Boston)
summary(b_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
b_scaled <- as.data.frame(b_scaled)
Scaled variables have 0 as their means. Scaling does not affect the variances. Also, I changed out dataset from a matrix to a data frame. I think it will be useful later.
Now we create a categorical variable of the crime rate. We are going to use the quantiles as the break points in this variable.
library(dplyr)
breaks <- quantile(b_scaled$crim)
crime <- cut(b_scaled$crim, breaks = breaks, include.lowest = TRUE, labels = c('low', 'med_low', 'med_high', 'high'))
b_scaled <- b_scaled[,2:14]
b_scaled <- data.frame(b_scaled, crime)
With the last 4 lines of code I deleted the column “crim”, add our new column “crime” and removed “crime” column from the test set.
For our future analysis we need to divide our data to train and test sets. We are going to do it now. We are going to put 80 per cents of the data into the train set.
ind <- sample(nrow(b_scaled), size = nrow(b_scaled)*0.8)
train <- b_scaled[ind,]
test <- b_scaled[-ind,]
correct_classes <- test$crime
test <- test[,1:13]
We are going to fit the linear discriminant analysis on the train set. We are using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
b_lda <- lda(crime ~ ., data = train)
The LDA plot are below.
b_arrows <- function(x, myscale = 1, arrow.heads = 0.1, color = 'pink', tex = 0.75, choices = c(1,2)) {
heads = coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale*heads[,choices[1]],
y1 = myscale*heads[,choices[2]],
col = color, length = arrow.heads)
text(myscale*heads[,choices], labels = row.names(heads), cex = tex, col = color, pos = 3)
}
classes = as.numeric(train$crime)
plot(b_lda, dimen = 2, col = classes)
b_arrows(b_lda, myscale = 2)
Then we are going to predict the classes with the LDA model on the test data.
b_predict <- predict(b_lda, newdata = test)
t <- table(correct = correct_classes, predicted = b_predict$class)
t
## predicted
## correct low med_low med_high high
## low 12 11 4 0
## med_low 5 15 5 0
## med_high 1 2 23 3
## high 0 0 0 21
I think the model predicted the best class “high”. Maybe it is because this class defers a lot from other classes. To in which class we have the most errors we can create the same table but with the percentages.
library(tidyverse)
c1 <- t[,1]/colSums(t)[1]
c2 <- t[,2]/colSums(t)[2]
c3 <- t[,3]/colSums(t)[3]
c4 <- t[,4]/colSums(t)[4]
t2 <- cbind(c1, c2, c3, c4)
colnames(t2) <- c('low', 'med_low', 'med_high', 'high')
t2
## low med_low med_high high
## low 0.66666667 0.39285714 0.12500 0.000
## med_low 0.27777778 0.53571429 0.15625 0.000
## med_high 0.05555556 0.07142857 0.71875 0.125
## high 0.00000000 0.00000000 0.00000 0.875
The prediction of the class “med_low” caused the biggest number of wrong predictions. And predictions of the class “high” were the most correct.
Now we are going to do k-means clustering on the data “Boston”. First of all, we have to decide how many clusters should we have. We can do by looking at how total WCSS changes when the number of clusters grows.
data(Boston)
b_scaled <- scale(Boston)
dist <- dist(b_scaled)
set.seed(123)
k_max <- 12
twscc <- sapply(1:k_max, function(k){
kmeans(b_scaled, k)$tot.withinss
})
qplot(x = 1:12, y = twscc, geom = 'line')
I think 2 is the optimal number of clusters.
Let’s do the k-means clustering with 2 clusters and interpret the results. To be honest it isn’t easy to interpret the results. Anyway, my thoughts:
Based on these thoughts, class 2 includes not very attractive houses (bad air, located next to factories). The houses of class 1 are located better, the air is better and that’s why their price is usually higher.
Let us first take a look at the data. Let us see what variable it has, what kind of variable they are. Also we will check if there is any correlations between the variables.
human <- read.csv('human.csv',row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ eduFM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ labFM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ ExpEdu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ ExpLife : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ MMR : int 4 6 6 5 6 7 9 28 11 8 ...
## $ BirthRate: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ PP : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
The variables stand for:
library(dplyr)
library(ggplot2)
library(GGally)
ggpairs(human)
cor(human) %>%corrplot::corrplot()
We notice that there is a highly negative correlation between MMR and ExpLife. That means that the bigger chance to die during the birth the shorter life is. Also, ExpEdu (Expected years of schooling) and ExpLife (Life expectancy at birth) have a strong positive correlation.
Now it is time to perform PCA. I am going to do it on the non standardized data.
pca_non_sta <- prcomp(human)
summary(pca_non_sta)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
biplot(pca_non_sta, choices = 1:2)
The results we got don’t make sense at all. The reason for it is that our data is not standardized.
Let us do PCA on the standardized data this time.
human <- scale(human) # standardizing
pca_sta <- prcomp(human)
summary(pca_sta)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
biplot(pca_sta, choices = 1:2, cex = c(0.5, 1), col = c("pink", "black"))
The analysis on the standardized data produces more realstic and more useful results than the previous one did.
The high maternal mortality ratio and the high adolescent birth rate are one of the signs of the developing country. On the other hand, hign education, long life and hinf GNI are something that is often related to the developed countries. So the first principal component measures how much the country is developed.
The second component have a strong positive correlation with PP and labFM, when PP stands for percetange of female representatives in parliament and labFM - female proportion in the labour force compared to male one. I would say that this component show how equal women are compared to the men.
Now we are going to do MCA on the tea dataset.
# download "tea"
library(FactoMineR)
library(tidyr)
data(tea)
tea <- dplyr::select(tea, one_of(c("tea.time", "tearoom", "Tea","age_Q", "sex")))
tea <- filter(tea, complete.cases(tea))
gather(tea) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size= 9))
mca <- MCA(tea, graph = FALSE)
plot(mca, invisible = c("ind"))
What I notice is that women are more likely to have tea time and tearooms that men. Also black tea is the most popular among old people 60+. Young people between 25-34 don’t usually have any tea time.
In this part, we are going to analyze the longitudinal data with the help of different graphs. We will not use any linear mixed-effects models.
For of all, let us download the data and explore it.
RATS <- read.csv("data/rats.csv")
RATS_o <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
Our data from is a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal???s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.
To begin we will plot the WEIGHT values for all rats, differentiating between the diet groups into which the rats have been randomized.
library(ggplot2)
ggplot(RATS, aes(x=time, y=weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times = 2)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = 'none') +
scale_y_continuous(limits = c(min(RATS$weight), max(RATS$weight)))
We notice that the initial weights of the rats in the 1st group are a lot smaller than those of the rats in other groups. And we might suggest that the diet of the 2nd and 3rd groups are better for the weight gain.
Also, one rat in the second group seems to be an outlier. We will check later if it should be removed.
Next, we will take a look at the plot which contains means and standard error of means.
library(tidyr)
library(dplyr)
n <- RATS$times %>% unique() %>% length()
RATS_means <- RATS %>%
group_by(Group, time) %>%
summarise( mean = mean(weight), se = sd(weight)/sqrt(n) ) %>%
ungroup()
ggplot(RATS_means, aes(x = time, y = mean, linetype = as.factor(Group), shape = as.factor(Group))) +
geom_line() +
geom_point(size=2) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.5)) +
scale_y_continuous(name = "mean(weight) +/- se(weight)")
Looking at this graph, I have the same feeling as previously. The rats on the 2nd and 3rd diets gained more weight than the rats that had the 1st diet. But we have to notice that the standard errors of the means are a lot smaller in the 1st group.
We noticed one outlier. Let us check if it should be removed.
RATS_out <- RATS %>%
filter(time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(weight) ) %>%
ungroup()
ggplot(RATS_out, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight), weeks 1-8")
We notice 1 outlier that differs significantly from others. I think that we can remove it.
RATS_t <- RATS_out %>%
filter(mean < 560)
Now it is time to make an anova test.
RATS_anova <- RATS_out %>%
mutate(baseline = RATS_o$WD1)
fit <- lm(mean ~ baseline + Group, data = RATS_anova)
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We notice that the baseline WEIGHT is strongly related to the WEIGHT values taken after the diets have started. These is no evidence of a diet difference after conditioning on the baseline value. That means that we can not say that there is a difference between these diets.
In this part, we are going to use different linear mixed-effects models to analyze the BPRS data.
We are going to download and explore our data now.
DATA <- read.csv("data/bprs.csv")
The DATA contains measurements of 40 male subjects that were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The scale is used to evaluate patients suspected of having schizophrenia.
We want to know if there is any difference between these treatments.
str(DATA)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
DATA$treatment <- factor(DATA$treatment)
DATA$subject <- factor(DATA$subject)
ggplot(DATA, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(DATA$bprs), max(DATA$bprs)))
We are going to create a few models now. We will choose the best one.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
model_1 <- lmer(bprs ~ week + treatment + (1|subject), data = DATA, REML = FALSE)
model_2 <- lmer(bprs ~ week + treatment + (week | subject), data = DATA, REML = FALSE)
model_3 <- lmer(bprs ~ week + treatment + (week | subject) + week*treatment, data = DATA, REML = FALSE)
anova(model_1, model_2, model_3)
## Data: DATA
## Models:
## model_1: bprs ~ week + treatment + (1 | subject)
## model_2: bprs ~ week + treatment + (week | subject)
## model_3: bprs ~ week + treatment + (week | subject) + week * treatment
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## model_1 5 2748.7 2768.1 -1369.4 2738.7
## model_2 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## model_3 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value is the smallest with the model “model_2”. So we are going to use it.
summary(model_2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: DATA
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
t value of the “treatment2” is 0.550, which means that there is no difference between the 1st and 2nd treatment.